Vison transformer adapter-based hyperbolic embeddings for multi-lesion segmentation in diabetic retinopathy

Diabetic Retinopathy (DR) is a major cause of blindness worldwide. Early detection and treatment are crucial to prevent vision loss, making accurate and timely diagnosis critical. Deep learning technology has shown promise in the automated diagnosis of DR, and in particular, multi-lesion segmentation tasks. In this paper, we propose a novel Transformer-based model for DR segmentation that incorporates hyperbolic embeddings and a spatial prior module. The proposed model is primarily built on a traditional Vision Transformer encoder and further enhanced by incorporating a spatial prior module for image convolution and feature continuity, followed by feature interaction processing using the spatial feature injector and extractor. Hyperbolic embeddings are used to classify feature matrices from the model at the pixel level. We evaluated the proposed model’s performance on the publicly available datasets and compared it with other widely used DR segmentation models. The results show that our model outperforms these widely used DR segmentation models. The incorporation of hyperbolic embeddings and a spatial prior module into the Vision Transformer-based model significantly improves the accuracy of DR segmentation. The hyperbolic embeddings enable us to better capture the underlying geometric structure of the feature matrices, which is important for accurate segmentation. The spatial prior module improves the continuity of the features and helps to better distinguish between lesions and normal tissues. Overall, our proposed model has potential for clinical use in automated DR diagnosis, improving accuracy and speed of diagnosis. Our study shows that the integration of hyperbolic embeddings and a spatial prior module with a Vision Transformer-based model improves the performance of DR segmentation models. Future research can explore the application of our model to other medical imaging tasks, as well as further optimization and validation in real-world clinical settings.

Recently, the efficacy of the Transformer-based model has demonstrated a trend of surpassing that of convolutional neural networks and provides state-of-the-art performance in natural image classification, detection, and segmentation 18 . However, it still needs further research in the application of deep learning techniques to medical image processing 19 . Researchers have pointed out that hyperbolic space may exhibit superior performance compared to the traditional neural network based on Euclidean space 20 . Therefore, hyperbolic embeddings were introduced in this paper to conduct the pixel-level classification, and the experimental results demonstrate that hyperbolic space can enhance the performance of the model. The following are the contributions of this work: • We propose a Transformer-based model named "VTA (Vision Transformer Adapter) + HBE (Hyperbolic Embeddings)" suitable for DR multi-lesion segmentation tasks. Our approach adapts the original Vision Transformer model by incorporating a spatial prior module, which leverages convolutional neural networks to extract image features. The architecture also incorporates the spatial feature injector and extractor to enhance feature interaction. • We employ hyperbolic embeddings to classify the feature representation at the pixel level. This solution effectively addresses the challenges encountered in current implementations of hyperbolic polynomial logistic regression, resulting in a more efficient parallel pixel classification method for image segmentation tasks. • The proposed model demonstrates superior performance in the segmentation of hard exudates and microaneurysms on the IDRiD dataset, and exceptional performance in the segmentation of microaneurysms and soft exudates on the DDR dataset.

Related work
Deep neural networks in DR lesion segmentation. Despite Deep Neural Networks (DNNs) 21,22 being prevalent in the segmentation of DR lesions, they still face two major challenges: significant morphological variations of DR lesions across different degrees and difficulty distinguishing DR lesions from similar structures 23 . The challenge in utilizing deep learning models for detecting small-size DR lesions lies in their ability to differentiate between normal and abnormal features. The limited size of these lesions, sometimes consisting of only a few pixels, can significantly impact the model's accuracy and hinder its ability to make reliable predictions.
Researchers have put forth numerous proposed enhancements and improvements to address the aforementioned challenges. Zhang et al. 24 present a feature fusion algorithm that leverages a multilayer attention mechanism for improved feature layer and channel fusion. The proposed method allows for more accurate selection of feature layers containing small target features, leading to improved preliminary detection of small targets. The findings of the study highlight the efficiency of the algorithm, resulting in a noteworthy increase in the average accuracy and sensitivity of microaneurysms detection. Wang et al. 25 introduced a semi-supervised collaborative learning model to enhance the precision of DR grading and lesion segmentation. The model leverages attention mechanism technology and utilizes low-level guidance to identify lesion features and high-level guidance to create lesion attention images. These attention images serve as pseudo masks to facilitate the training process of the segmentation model. Xie et al. 22 propose a novel and versatile framework to improve the precision of existing deep convolutional neural networks for medical image segmentation (including DR multi-lesion segmentation).

Hyperbolic deep learning.
DNNs are characterized by their multi-layer hybrid structure and multiple residual connections, allowing for the potential to model complex functions theoretically and leading to their dominance in research areas such as image classification and segmentation. Neural architectures based on Euclidean space are optimized primarily for raster data, which limits their ability to effectively handle optimization problems involving structured data in non-Euclidean spaces. Its relying on local proximity can result in the incorrect representation of geometric structures and undermine the effectiveness of these architectures for such tasks. The Hyperbolic deep learning has gained widespread attention for its ability to effectively represent treelike structures, taxonomies 26,27 , text 28,29 , and graph data 30,31 . Researchers have put forth a number of hyperbolic alternatives for network layers that span from intermediate to classification layers 20,32,33 .
A recent study by 34 has expanded the utilization of hyperbolic space for semantic image segmentation. The research team has reformulated the hyperbolic multinomial logistic regression approach to ensure tractability. The finding of using hyperbolic space in semantic image segmentation can benefit with higher proficiency, such as enhanced zero-shot generalization and improved performance in low-dimensional embeddings. Ganea et al. 32 bridge the gap between hyperbolic and Euclidean geometry in neural networks and deep learning, opening new possibilities in Geometric Deep Learning (GDL). They achieve this by generalizing basic operations, multinomial logistic regression, feed-forward, and gated recurrent neural networks to the Poincaré model of hyperbolic geometry using gyrovector spaces and generalized Möbius transformations. They introduce a unified framework that smoothly parametrizes basic operations and objects in constant negative curvature spaces and demonstrate how Euclidean and hyperbolic spaces can be transformed into each other. The effectiveness of hyperbolic neural network layers is demonstrated through experiments on textual entailment and noisy-prefix recognition tasks.
Transformer in medical images. Vaswani et al. 35 introduced Transformers architecture, a transformative design that features encoders and decoders as its fundamental components. The encoders employ attention mechanisms to consolidate information from input sequences into high-dimensional representations, while decoders are employed to extract these high-dimensional representations to generate target sequences. Since their inception, Transformer-based architectures have established a strong track record of delivering state-ofthe-art performance on a variety of natural language processing and computer vision tasks. The success of Transformers is attributed to its highly parallelizable, which allows for efficient training on large datasets and fast inference during deployment and captures context effectively 36  www.nature.com/scientificreports/ continues to be significant, with ongoing research aimed at further improving their performance and exploring their applications in new domains.
Recently, the inception of the Vision Transformer model has sparked a trend in the medical image processing community towards the adoption of Transformer-based architectures or hybrid convolutional neural networks to enhance model performance 19 . Shen et al. 37 proposed a novel convolution-and-transformer network, which is built on the encoder-decoder architecture and demonstrates efficacy in the segmentation of kidney cysts. Wang et al. 38 were pioneers in applying Transformer-based architecture for the efficient 3D segmentation of brain tumors. Their network leverages the encoder to extract volumetric spatial features, which are then transmitted to the decoder for upsampling and the generation of a full-resolution segmentation map. Yun et al. 39 proposed a novel Spectral Transformer model for the segmentation of hyperspectral pathology images. The model leverages a sequence-to-sequence prediction procedure to facilitate the learning of contextual features from spectral bands, thus enabling more accurate and efficient segmentation. Despite their remarkable performance in various applications, there remains potential for further exploration and development of Transformers in the field of medical imaging.

Hyperbolic space
Poincaré ball model. The Poincaré ball model, named after the mathematician Henri Poincaré, is a mathematical representation of hyperbolic geometry. We utilize Euclidean concepts such as distance and angle to reason about hyperbolic spaces, making it an effective way to study and comprehend the unique properties of hyperbolic geometry, which differ from those of Euclidean geometry. In addition to its utility in visualization, the Poincaré ball model allows for the application of standard Euclidean algorithms to perform geometric calculations in the hyperbolic plane, making it computationally practical. One of the primary applications of the Poincaré ball model is in computer graphics, particularly for representing the conformal structure of surfaces and analyzing complex datasets. The Poincaré ball model allows for the depiction of the shape and curvature of a surface without regard to its size or position, making it a useful tool for analyzing and manipulating complex datasets in this field. The growing attention the Poincaré ball model has received in machine learning and data mining fields due to its ability to facilitate analysis and manipulation of large and complex datasets 33 . Eli et al. 40 focus on the Poincaré ball model and use tangent space formalization to express classification problems, and the proposed algorithm provably converge and are highly scalable as they have complexities comparable to those of their Euclidean counterparts. They demonstrate superior performance accuracy on complex synthetic datasets and real-world datasets. Guo et al. 41 proposed a Poincaré-based heterogeneous graph neural network for sequential recommendation, which models both sequential pattern information and hierarchical information.
The Poincaré ball model is a mathematical model that is used to represent hyperbolic geometry. It is a Riemannian manifold, denoted as B n c , g B c , where B n c = u ∈ R n : √ c � u �< 1 is the open ball of radius 1 √ c in -dimensional Euclidean space and g B c is the Riemannian metric defined as: where | · | is the l 2 norm and �·, ·� is the standard inner product. The curvature of the Poincaré ball model is determined by the value of c. When c = 0 , the Poincaré ball model reduces to Euclidean space, i.e. B n c = R n . In this case, the Riemannian metric becomes the standard inner product, and the Poincaré ball model represents the familiar geometry of flat spaces. For c > 0 , the Poincaré ball model represents hyperbolic geometry, in which the curvature and radius is determined by the value of c. The Poincaré ball includes two fundamental operator operations: Möbius addition and scalar multiplication. These operations correspond to vector addition and scalar multiplication in Euclidean spaces, respectively. The Möbius addition is a non-commutative and nonassociative operation that extends the concept of vector addition to the Poincaré ball, while scalar multiplication extends the concept of scalar multiplication from Euclidean spaces to the Poincaré ball. The Möbius addition ⊕ c of u, v ∈ B n is defined as: The operation ⊕ c hold the following equalities: u⊕ c 0 = 0⊕ c u = u,(−u⊕ c u = u⊕ c (−u) = 0) . Furthermore, the operation ⊕ c recovers the Euclidean addition when c approaches zero, i.e. c → 0 ⇒ u⊕ c v → u + v . The Möbius scalar multiplication ⊗ c of vector u ∈ B\{0} by a scalar v ∈ R is defined according to: and v⊗ c 0 = 0 . The Möbius scalar multiplication operation ⊗ c converges to the standard Euclidean scalar multiplication as the scalar parameter c approaches zero. Mathematically, this can be represented as c → 0 ⇒ v⊗ c u = vu . The distance function of u, v ∈ B n in the Poincare model is given by: www.nature.com/scientificreports/ Segmentation networks are typically designed to operate in Euclidean space. However, in order to execute segmentation within the Poincaré ball, it is required to establish a mapping from the Euclidean tangent space to the hyperbolic space. One way to achieve this is through the use of the exponential map, which projects a Euclidean vector onto the Poincaré ball with a fixed anchor point. For p ∈ D n c , the exponential map exp p : T p D n c → D n c is given by: The exponential map is a mathematical function that can be used to map a tangent vector at a point in a manifold to a point on the manifold itself. This projection allows the segmentation network to operate effectively in the Poincaré ball while maintaining the geometric properties of the hyperbolic space. The above maps have more appealing forms, when p = 0 , namely for v ∈ T 0 D n c \0, y ∈ D n c \0: Hyperbolic embeddings. The problem of image segmentation involves the task of assigning a label to each pixel in an input image. The input RGB image is represented as X ∈ R w×h×3 , where w and h are the image's width and height, respectively. A function f (X) : R w×h×3 → R w×h×n is used to transform each pixel in the input image to an n-dimensional representation matrix Y ∈ R w×h×n . A popular technique among contemporary methods for image classification is to process all pixels simultaneously by passing them through a linear layer and applying the softmax function to generate a C-dimensional probability distribution for each pixel across all C classes, i.e. f (Y ) : R w×h×n → R w×h×C . The optimization of this approach is typically accomplished by using cross-entropy as the objective function. The pipeline design facilitates parallel processing of all pixels, thereby maximizing efficiency and optimizing the model through minimization of cross-entropy loss. The purpose of this study is to examine the application of hyperbolic space in the context of pixel-level classification for image segmentation. The gyroplane represents a hyperplane within the Poincaré ball, based on the geometric interpretation of hyperbolic multinomial logistic regression 42 . Specifically, for p ∈ D n c , a ∈ T p D n c \0 , the Poincaré hyperplane is defined as: where z ij = exp 0 (f (X) ij ) denote the result of applying the exponential map to the neural network output at pixel location (i, j), p ∈ D n c is the reference point and a ∈ T p D n c is the normal vector of the gyroplane. The set H a,p can also defined as the union of all images of geodesics in the hyperbolic space D n c that are orthogonal to vector a and pass through the point p [27]. The hyperbolic distance between the point z ij and the gyroplane H c y of class y, can be computed as Eq. (8). The Fig. 1 visualized the hyperbolic gyroplane H c y and distance to output z ij on the manifold. www.nature.com/scientificreports/ Approximate treatment in hyperbolic space. The task of image segmentation necessitates simultaneous per-pixel classification. However, current implementations of hyperbolic multinomial logistic regression are computationally infeasible. To reduce the memory footprint of explicitly calculating the Möbius addition, we leverage an alternative computation for the margin likelihood that eliminates the need for explicit calculation of the Möbius addition. We overwrite the inner product in the numerator and squared norm in the denominator of Eq. (8). The overwrite of the inner product p y ⊕ c z ij , a y is defined as: 1+2c�p y ,z ij �+c 2 �p y � 2 �z ij � 2 . The squared norm � p y ⊕ c z ij � 2 of the Möbius addition can be performed efficiently utilizing the following method: Consequently, the logit of per pixel is computed as: The optimization of logit can be achieved through the implementation of the cross-entropy loss function and the gradient descent algorithm. Our method employs a approximation of the inner product and squared norm in the calculation of class logits, enabling the possibility of hyperbolic pixel-level classification. This novel approach effectively addresses the intractability previously encountered in current implementations of hyperbolic multinomial logistic regression, and enables more efficient per-pixel classification in parallel for image segmentation tasks.
(9) �p y ⊕ c z i j, a y � = �Ap y + Bz ij , a y � = A�p y , a� + B�z ij , a�, Vision transformer encoder. The ViT-Adapter design philosophy is a way of leveraging scalable NLP Transformer architectures for vision tasks 43 . The benefit of this straightforward design is that it allows us to leverage scalable NLP Transformer architectures and their efficient implementations almost immediately. The ViT-Adapter holds the potential to reconcile the discrepancy between ViT 44 model and vision-specific model for segmentation task while maintaining the versatility of ViT, and could reap the benefits of advanced multi-modal pre-training techniques. The Transformer receives 1D sequences of token embeddings as input. In order to adapt it for the processing of 2D medical images, we reshape the image X ∈ R w×h×3 into a flattened 2D patches sequence x p ∈ R N×P 2 ×3 , where (P, P) is the resolution of each patch, N = w×h P 2 is the number of patches. The Transformer encoder subsequently comprises sequential layers of Multi-Headed Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. The MSA block is responsible for capturing the relationships between different elements in the input sequence. It does this by computing multiple attention heads in parallel, where each head learns a different aspect of the input sequence. The MLP block is responsible for applying a non-linear transformation to the output of the MSA block.
Layer Normalization (LN) is applied after each sub-layer in the Transformer encoder, including the MSA and MLP blocks. It normalizes the output of each sub-layer by subtracting its mean and dividing by its standard deviation, which improves the stability of the model during training and allows it to better generalize to new data. The residual function is added to the output of the sub-layer, which helps to mitigate the vanishing gradient problem and allows the model to learn deeper representations. Layer Normalization is implemented prior to each block, and residual connections are employed following each block. The ViT Encoder comprises a total of L layers, which can be mathematical represented as: For image segmentation, we divide the encoder layers into uniform encoder block, and use the feature tokens from the each encoder block to feed into the vision transformer adapter module. The input image for the Transformer encoder is first processed through patch embeddings, where it is divided into non-overlapping 16 × 16 patches. 45,46 have demonstrated that utilizing convolutions with overlapping sliding windows can enhance the ability of transformers to effectively capture local continuity in input images. We present a novel addition to the Transformer encoder layer: the Spatial Prior Module (SPM), a convolution-based structure, which downsamples a w × h input image to various scales. The SPM module design architecture is shown in Fig. 3. The objective of this module is to concurrently model the local spatial contexts of images alongside the patch embeddings layer while preserving the integrity of the original architecture of the Vision Transformer. www.nature.com/scientificreports/ We first feed the input image into the spatial prior module, and obtain a feature pyramid f 1 , f 2 , f 3 , which contains D-dimensional feature maps (1024 dimension) with resolutions of 1/8, 1/16, and 1/32. Then we flatten and concatenate these feature maps into feature tokens F sp 1 ∈ R HW 8 2 + HW 16 2 + HW 32 2 ×D for feature interaction.

Vision transformer adapter. Spatial prior module. Recent studies
Spatial feature injector. The columnar structure of ViT results in single-scale and low-resolution feature maps, which negatively impact its performance in segmentation tasks relative to pyramid-structured transformers. To address this challenge, we propose the implementation of feature interaction modules, specifically the Spatial Feature Injector (SFI), to enhance communication between the adapter and ViT, thereby improving performance. We partition the transformer encoders of ViT into N equal blocks, the feature b i are from encoder block i of ViT model. To incorporate the spatial feature F sp i into b i , we employ multi-head cross-attention, which can be formulated as: where γ i is a learnable parameter to modulate the balance the output of attention layer and the spatial feature b i , which is initialized with a value of zero. The Pseudo-codes of the process of spatial feature injector are provided in Algorithm 1. Generate random feature maps with size 16x16 and depth 64; 6 Generate random spatial features with size 16x16 and depth 16; 7 end 8 Initialize gamma parameters γ i to zero; 9 Define multi-head cross-attention function; 10 Compute attention using Eq. (13) for each transformer encoder block i; 11 Apply LN operation to the result; Spatial feature extractor. Upon incorporating the spatial feature via the SFI module, the output feature b i+1 is obtained. Multi-scale feature extractor can enhance the spatial feature and extract the multi-scale feature. We employ a cross-attention layer to facilitate communication between the output feature b i+1 and the spatial feature F sp i . Subsequently, a Convolutional Feed-Forward Network (CFFN) is introduced following the attention layer. Algorithm 2 shows the pseudo-codes of the process of spatial feature extractor, and the process are formulated as:  Model configurations. Our model is designed to convert each pixel into an n-dimensional representation.
In our experiments, the patch size of the ViT model is fixed at 16, D-dimension is set to 1024. The interaction times N is set to 4, which involves dividing the ViT encoder layers into 4 equal blocks for feature interaction. The width of ViT is set to 768, with a feed-forward network (FFN) size of 3072 and 12 heads. To reduce computational overhead, the ratio of the CFFN is set to 1/4, with a hidden size of 96 and the adapter has 12 heads. Implementation details. Our framework is implemented utilizing the PyTorch platform and executed on three NVIDIA GeForce RTX 3090Ti GPUs, each equipped with 24GB of memory. The newly integrated adapter modules have been randomly initialized and do not employ any pre-trained weights. The initial learning rate is set to 0.001. The model is trained for 160 epochs on the ADE20K dataset with a batch size of 2 and then finetuned on the IDRiD and DDR datasets. The optimization of Euclidean parameters is performed using Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a polynomial learning rate decay of power 0.9. Hyperbolic parameters are optimized utilizing Riemannian Stochastic Gradient Descent (RSGD).

Evaluation metrics.
To evaluate the efficacy of the proposed model, the performance metrics used include the Area-Under-the-Curve (AUC) of the Precision-Recall (PR) curve and the Receiver Operating Characteristic (ROC) curve. These metrics have been widely adopted in previous research and competitions on fundus image segmentation. The assessment of the accuracy of true data in predictions is primarily conducted through the AUC_PR curve, while the performance of positively predicted data is evaluated using the AUC_ROC curve. Both the AUC_PR and AUC_ROC curves characterize the overall performance of different neural models.
Experimental results. We trained DeepLab v3+ 49 , UNet 50 , UNet++ 51 and Seg-B/16 52 models on the IDRiD and DDR datasets using the model training parameters initialized according to the methods described in the corresponding original papers. Our VTA+HBE model was compared with these popular models. Table 1 displays the comparison results of the models on the IDRiD test set, showing that the VTA+HBE model achieved the highest performance in MA and SE segmentation predictions, and Seg-B/16's performance in EX and HE segmentation predictions was comparable. Table 2 displays the comparison results of the models on the DDR test set, showing that the VTA+HBE model achieved the best performance in MA and EX segmentation predictions, but the EX and HE segmentation results were slightly inferior to those of the Seg-B/16 model. Ablation studies. We employ the vision transformer with transposed convolution as our baseline, in which the feature sequence generated by each coding block is reshaped, then subjected to upsampling via transposed convolution. The resulting upsampled feature matrix is then reshaped and utilized as the input feature sequence www.nature.com/scientificreports/ for the next coding block. This baseline preserves the dimensions and size of each feature sequence in our model. We enhanced this baseline with our proposed VTA technique and named the resulting model as "baseline+VTA". Further, we added (HBE) to classify the feature matrix at the pixel level, resulting in the "VTA + HBE" model. To demonstrate the efficacy of VTA and HBE, we randomly selected and visualized predictions from the IDRiD and DDR test sets. Figure 4 presents a comparison of the segmentation results with the original images and ground truths. Segmentation plots of the baseline, baseline + VTA, and VTA + HBE models were used to show the improvement of each component of our network. The green and yellow boxes highlight the areas where . Visualization of the efficacy of the VTA and the HBE. We randomly selected and visualized predictions from the IDRiD and DDR test sets. The green and yellow boxes highlight the areas where DR segmentation was improved by the VTA and the HBE, respectively. As can be seen from the figure, the HBE enhances finer segmentation predictions, while the VTA prevents certain misclassification predictions.  Tables 5 and 6. As shown in the table, the performance of the model improves to some extent with an increase in the curvature value, but performance decreases rapidly when the curvature value is greater than 2.

Conclusions
The main contribution of this study is the introduction of a hyperbolic space-based Transformer model architecture for DR image segmentation. The VTA+HBE model adapts the original Vision Transformer model by adding a spatial prior module that leverages convolutional neural networks to extract image features. This allows the model to capture both spatial and semantic information, which is crucial for the accurate segmentation of lesions in DR. The VTA+HBE model also incorporates a spatial feature injector and extractor to improve feature interaction. The hyperbolic embeddings for pixel-level classification addresses the challenges faced by current implementations of hyperbolic polynomial logistic regression, resulting in more efficient and accurate segmentation of lesions in DR. Through extensive experimentation, we provide compelling evidence of the efficacy of the proposed Transformer-based model architecture in extracting meaningful features from DR images, while also demonstrating the potential benefits of incorporating hyperbolic space within deep learning frameworks. Given the severe consequences of DR, including vision impairment and blindness, the research community has shown a growing interest in developing automated DR detection systems. Therefore, the significance of this study lies in its potential to aid in the accurate diagnosis of DR by automating the detection of DR lesions.
In future research, we will delve deeper into exploring the synergistic integration of hyperbolic space and diverse deep learning architectures, with a specific focus on their application in DR classification and segmentation tasks. We aim to advance the field by investigating novel techniques and methodologies that leverage the unique properties of hyperbolic space to enhance the performance, interpretability, and generalization capabilities of deep learning models when applied to DR classification or segmentation tasks.

Data availibility
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.